PCNN: Projection Convolutional Neural Networks
55
∂LP
∂Cl
i
= λ
J
j
W l
j ◦
Cl
i + ηδ ˆ
Cl
i,j
−ˆCl
i,j
◦
W l
j,
(3.52)
where 1 is the indicator function [199] widely used to estimate the gradient of the nondif-
ferentiable function. More specifically, the output of the indicator function is 1 only if the
condition is satisfied; otherwise, 0. Updating W l
j: Likewise, the gradient of the projection
parameter δW l
j consists of the following two parts
δW l
j = ∂L
∂W l
j
= ∂LS
∂W l
j
+ ∂LP
∂W l
j
,
(3.53)
W l
j ←W l
j −η2δW l
j ,
(3.54)
where η2 is the learning rate for W l
j. We also have the following.
∂LS
∂W l
j
=
J
h
∂LS
∂
W l
j
h
=
J
h
I
i
∂LS
∂ˆCl
i,j
∂P l,j
ΩN (
W l
j, Cl
i)
∂(
W l
j ◦Cl
i)
∂(
W l
j ◦Cl
i)
∂
W l
j
h
=
J
h
I
i
∂LS
∂ˆCl
i,j
◦1−1≤
W l
j◦Cl
i≤1 ◦Cl
i
h
,
(3.55)
∂LP
∂W l
j
=λ
J
h
I
i
W l
j ◦
Cl
i +ηδ ˆ
Cl
i,j
−ˆCl
i,j
◦
Cl
i +ηδ ˆ
Cl
i,j
h
,
(3.56)
where h indicates the hth plane of the tensor along the channels. It shows that the proposed
algorithm can be trained from end to end, and we summarize the training procedure in
Algorithm 13. In the implementation, we use the mean of W in the forward process but
keep the original W in the backward propagation.
Note that in PCNNs for BNNs, we set U = 2 and a2 = −a1. Two binarization processes
are used in PCNNs. The first is the kernel binarization, which is done based on the projec-
tion onto ΩN, whose elements are calculated based on the mean absolute values of all full
precision kernels per layer [199] as
1
I
I
i
∥Cl
i∥1
,
(3.57)
where I is the total number of kernels.
3.5.7
Progressive Optimization
Training 1-bit CNNs is a highly non-convex optimization problem, and initialization states
will significantly impact the convergence. Unlike the method in [159] that a real-valued CNN
model with the clip function pre-trained on ImageNet initializes the 1-bit CNNs models,
we propose applying a progressive optimization strategy in training 1-bit CNNs. Although
a real-valued CNN model can achieve high classification accuracy, we doubt the converging
states between real-value and 1-bit CNNs, which may mistakenly guide the converging
process of 1-bit CNNs.